116 research outputs found
Federated Survival Forests
Survival analysis is a subfield of statistics concerned with modeling the occurrence time of a particular event of interest for a population. Survival analysis found widespread applications in healthcare, engineering, and social sciences. However, real-world applications involve survival datasets that are distributed, incomplete, censored, and confidential. In this context, federated learning can tremendously improve the performance of survival analysis applications. Federated learning provides a set of privacy-preserving techniques to jointly train machine learning models on multiple datasets without compromising user privacy, leading to a better generalization performance. However, despite the widespread development of federated learning in recent AI research, few studies focus on federated survival analysis. In this work, we present a novel federated algorithm for survival analysis based on one of the most successful survival models, the random survival forest. We call the proposed method Federated Survival Forest (FedSurF). With a single communication round, FedSurF obtains a discriminative power comparable to deep-learning-based federated models trained over hundreds of federated iterations. Moreover, FedSurF retains all the advantages of random forests, namely low computational cost and natural handling of missing values and incomplete datasets. These advantages are especially desirable in real-world federated environments with multiple small datasets stored on devices with low computational capabilities. Numerical experiments compare FedSurF with state-of-the-art survival models in federated networks, showing how FedSurF outperforms deep-learning-based federated algorithms in realistic environments with non-identically distributed data
Scaling Survival Analysis in Healthcare with Federated Survival Forests: A Comparative Study on Heart Failure and Breast Cancer Genomics
Survival analysis is a fundamental tool in medicine, modeling the time until
an event of interest occurs in a population. However, in real-world
applications, survival data are often incomplete, censored, distributed, and
confidential, especially in healthcare settings where privacy is critical. The
scarcity of data can severely limit the scalability of survival models to
distributed applications that rely on large data pools. Federated learning is a
promising technique that enables machine learning models to be trained on
multiple datasets without compromising user privacy, making it particularly
well-suited for addressing the challenges of survival data and large-scale
survival applications. Despite significant developments in federated learning
for classification and regression, many directions remain unexplored in the
context of survival analysis. In this work, we propose an extension of the
Federated Survival Forest algorithm, called FedSurF++. This federated ensemble
method constructs random survival forests in heterogeneous federations.
Specifically, we investigate several new tree sampling methods from client
forests and compare the results with state-of-the-art survival models based on
neural networks. The key advantage of FedSurF++ is its ability to achieve
comparable performance to existing methods while requiring only a single
communication round to complete. The extensive empirical investigation results
in a significant improvement from the algorithmic and privacy preservation
perspectives, making the original FedSurF algorithm more efficient, robust, and
private. We also present results on two real-world datasets demonstrating the
success of FedSurF++ in real-world healthcare studies. Our results underscore
the potential of FedSurF++ to improve the scalability and effectiveness of
survival analysis in distributed settings while preserving user privacy
Heterogeneous Datasets for Federated Survival Analysis Simulation
Heterogeneous Datasets for Federated Survival Analysis Simulation
This repo contains three algorithms for constructing realistic federated datasets for survival analysis. Each algorithm starts from an existing non-federated dataset and assigns each sample to a specific client in the federation. The algorithms are:
uniform_split: assigns each sample to a random client with uniform probability;
quantity_skewed_split: assigns each sample to a random client according to the Dirichlet distribution [3, 4];
label_skewed_split: assigns each sample to a time bin, then assigns a set of samples from each bin to the clients according to the Dirichlet distribution [3, 4].
For more information, please take a look at our paper at https://arxiv.org/abs/2301.12166 [1].
Content
federated_survival_datasets.zip: the content of the repository at https://github.com/archettialberto/federated_survival_datasets
Heterogheneous_Datasets_for_Federated_Survival_Analysis_Simulation.pdf: the conference paper describing the work.
Installation
Federated Survival Datasets is built on top of numpy and scikit-learn. To install those libraries you can run pip install -r requirements.txt. To import survival datasets into your project, we strongly recommend SurvSet (https://github.com/ErikinBC/SurvSet) [2], a comprehensive collection of more than 70 survival datasets.
Usage
import numpy as np
import pandas as pd
from federated_survival_datasets import label_skewed_split
# import a survival dataset and extract the input array X and the output array y
df = pd.read_csv("metabric.csv")
X = df[[f"x{i}" for i in range(9)]].to_numpy()
y = np.array([(e, t) for e, t in zip(df["event"], df["time"])], dtype=[("event", bool), ("time", float)])
# run the splitting algorithm
client_data = label_skewed_split(num_clients=8, X=X, y=y)
# check the number of samples assigned to each client
for i, (X_c, y_c) in enumerate(client_data):
print(f"Client {i} - X: {X_c.shape}, y: {y_c.shape}")
We provide an example notebook in the zipped folder to illustrate the proposed algorithms. It requires scikit-survival, seaborn, and pandas.
References
[1] Archetti, A., Lomurno, E., Lattari, F., Martin, A., & Matteucci, M. (2023). Heterogeneous Datasets for Federated Survival Analysis Simulation. arXiv preprint arXiv:2301.12166.
[2] Drysdale, E. (2022). SurvSet: An open-source time-to-event dataset repository. arXiv preprint arXiv:2203.03094.
[3] Hsu, T. M. H., Qi, H., & Brown, M. (2019). Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335.
[4] Li, Q., Diao, Y., Chen, Q., & He, B. (2022, May). Federated learning on non-iid data silos: An experimental study. In 2022 IEEE 38th International Conference on Data Engineering (ICDE) (pp. 965-978). IEEE
The Bi-objective Long-haul Transportation Problem on a Road Network
In this paper we study a long-haul truck scheduling problem where a path has
to be determined for a vehicle traveling from a specified origin to a specified
destination. We consider refueling decisions along the path, while accounting
for heterogeneous fuel prices in a road network. Furthermore, the path has to
comply with Hours of Service (HoS) regulations. Therefore, a path is defined by
the actual road trajectory traveled by the vehicle, as well as the locations
where the vehicle stops due to refueling, compliance with HoS regulations, or a
combination of the two. This setting is cast in a bi-objective optimization
problem, considering the minimization of fuel cost and the minimization of path
duration. An algorithm is proposed to solve the problem on a road network. The
algorithm builds a set of non-dominated paths with respect to the two
objectives. Given the enormous theoretical size of the road network, the
algorithm follows an interactive path construction mechanism. Specifically, the
algorithm dynamically interacts with a geographic information system to
identify the relevant potential paths and stop locations. Computational tests
are made on real-sized instances where the distance covered ranges from 500 to
1500 km. The algorithm is compared with solutions obtained from a policy
mimicking the current practice of a logistics company. The results show that
the non-dominated solutions produced by the algorithm significantly dominate
the ones generated by the current practice, in terms of fuel costs, while
achieving similar path durations. The average number of non-dominated paths is
2.7, which allows decision makers to ultimately visually inspect the proposed
alternatives
SGDE: Secure Generative Data Exchange for Cross-Silo Federated Learning
Privacy regulation laws, such as GDPR, impose transparency and security as
design pillars for data processing algorithms. In this context, federated
learning is one of the most influential frameworks for privacy-preserving
distributed machine learning, achieving astounding results in many natural
language processing and computer vision tasks. Several federated learning
frameworks employ differential privacy to prevent private data leakage to
unauthorized parties and malicious attackers. Many studies, however, highlight
the vulnerabilities of standard federated learning to poisoning and inference,
thus raising concerns about potential risks for sensitive data. To address this
issue, we present SGDE, a generative data exchange protocol that improves user
security and machine learning performance in a cross-silo federation. The core
of SGDE is to share data generators with strong differential privacy guarantees
trained on private data instead of communicating explicit gradient information.
These generators synthesize an arbitrarily large amount of data that retain the
distinctive features of private samples but differ substantially. In this work,
SGDE is tested in a cross-silo federated network on images and tabular
datasets, exploiting beta-variational autoencoders as data generators. From the
results, the inclusion of SGDE turns out to improve task accuracy and fairness,
as well as resilience to the most influential attacks on federated learning
Towards cross-cohort estimation of cognitive decline in neurodegenerative diseases
International audienceHeterogeneity of cohorts, in terms of inclusion criteria, design of follow-up visits and batteries of cognitive assessments, hinders any thorough comparisons between them. For that reason, we build a cross-cohort model of cognitive decline that can be personalized to any patient, allowing to impute partially or totally missing scores. This enables to compare at an individual level disease progression of subjects from different cohorts, with a temporal realignment and regarding a broader set of biomarkers
Differences Between Plasma and Cerebrospinal Fluid p-tau181 and p-tau231 in Early Alzheimer's Disease
Plasma phosphorylated tau species have been recently proposed as peripheral markers of Alzheimer's disease (AD) pathology. In this cross-sectional study including 91 subjects, plasma and cerebrospinal fluid (CSF) p-tau181 and p-tau231 levels were elevated in the early symptomatic stages of AD. Plasma p-tau231 and p-tau181 were strongly related to CSF phosphorylated tau, total tau and amyloid and exhibited a high accuracy-close to CSF p-tau231 and p-tau181-to identify AD already in the early stage of the disease. The findings might support the use as diagnostic and prognostic peripheral AD biomarkers in both research and clinical settings
Rare mutations in SQSTM1 modify susceptibility to frontotemporal lobar degeneration
Mutations in the gene coding for Sequestosome 1 (SQSTM1) have been genetically associated with amyotrophic lateral sclerosis (ALS) and Paget disease of bone. In the present study, we analyzed the SQSTM1 coding sequence for mutations in an extended cohort of 1,808 patients with frontotemporal lobar degeneration (FTLD), ascertained within the European Early-Onset Dementia consortium. As control dataset, we sequenced 1,625 European control individuals and analyzed whole-exome sequence data of 2,274 German individuals (total n = 3,899). Association of rare SQSTM1 mutations was calculated in a meta-analysis of 4,332 FTLD and 10,240 control alleles. We identified 25 coding variants in FTLD patients of which 10 have not been described. Fifteen mutations were absent in the control individuals (carrier frequency < 0.00026) whilst the others were rare in both patients and control individuals. When pooling all variants with a minor allele frequency < 0.01, an overall frequency of 3.2 % was calculated in patients. Rare variant association analysis between patients and controls showed no difference over the whole protein, but suggested that rare mutations clustering in the UBA domain of SQSTM1 may influence disease susceptibility by doubling the risk for FTLD (RR = 2.18 [95 % CI 1.24-3.85]; corrected p value = 0.042). Detailed histopathology demonstrated that mutations in SQSTM1 associate with widespread neuronal and glial phospho-TDP-43 pathology. With this study, we provide further evidence for a putative role of rare mutations in SQSTM1 in the genetic etiology of FTLD and showed that, comparable to other FTLD/ALS genes, SQSTM1 mutations are associated with TDP-43 pathology
- …